智能论文笔记

Colorsseum是一种开放式和公开可用的大型无线无线测试，可通过虚拟化和软载波形和协议堆栈进行实验研究，在完全可编程的“白盒子”平台上。通过256最先进的软件定义的无线电和巨大的通道仿真器核心，罗马斗兽场几乎可以模拟任何方案，在各种部署和渠道条件下，可以在规模上进行设计，开发和测试解决方案。通过有限脉冲响应滤波器通过高保真FPGA的仿真再现这些罗马孔射频场景。过滤器模拟所需的无线通道的抽头，并将它们应用于无线电节点生成的信号，忠实地模拟现实世界无线环境的条件。在本文中，我们将罗马斗兽场介绍为测试楼，这是第一次向研究界开放。我们描述了罗马斗兽场的建筑及其实验和仿真能力。然后，我们通过示例性用例证明了罗马斗兽场对实验研究的有效性，包括频谱共享和无人空中车辆场景的普遍用途用例，包括普遍的无线技术（例如，蜂窝和Wi-Fi）。斗兽索斗兽场未来更新的路线图总结了这篇论文。

translated by 谷歌翻译

AfroLM: A Self-Active Learning-based Multilingual Pretrained Language Model for 23 African Languages

Bonaventure F. P. Dossou , Atnafu Lambebo Tonja , Oreen Yousuf , Salomey Osei , Abigail Oppong , Iyanuoluwa Shode , Oluwabusayo Olufunke Awoyomi , Chris Chinenye Emezue

分类：自然语言处理 | 人工智能 | 机器学习

2022-11-07

In recent years, multilingual pre-trained language models have gained prominence due to their remarkable performance on numerous downstream Natural Language Processing tasks (NLP). However, pre-training these large multilingual language models requires a lot of training data, which is not available for African Languages. Active learning is a semi-supervised learning algorithm, in which a model consistently and dynamically learns to identify the most beneficial samples to train itself on, in order to achieve better optimization and performance on downstream tasks. Furthermore, active learning effectively and practically addresses real-world data scarcity. Despite all its benefits, active learning, in the context of NLP and especially multilingual language models pretraining, has received little consideration. In this paper, we present AfroLM, a multilingual language model pretrained from scratch on 23 African languages (the largest effort to date) using our novel self-active learning framework. Pretrained on a dataset significantly (14x) smaller than existing baselines, AfroLM outperforms many multilingual pretrained language models (AfriBERTa, XLMR-base, mBERT) on various NLP downstream tasks (NER, text classification, and sentiment analysis). Additional out-of-domain sentiment analysis experiments show that \textbf{AfroLM} is able to generalize well across various domains. We release the code source, and our datasets used in our framework at https://github.com/bonaventuredossou/MLM_AL.

translated by 谷歌翻译

语言模型预训练的最新进展利用大规模数据集创建多语言模型。但是，这些数据集中大多遗漏了低资源语言。这主要是因为网络上没有很好地表示口语，因此被排除在用于创建数据集的大规模爬网中。此外，这些模型的下游用户仅限于最初选择用于预训练的语言的选择。这项工作调查了如何最佳利用现有的预培训模型来为16种非洲语言创建低资源翻译系统。我们关注两个问题：1）如何将预训练的模型用于初始预培训中未包含的语言？ 2）生成的翻译模型如何有效地转移到新域？为了回答这些问题，我们创建了一个新的非洲新闻语料库，涵盖16种语言，其中8种语言不属于任何现有评估数据集的一部分。我们证明，将两种语言转移到其他语言和其他领域的最有效策略是，以少量的高质量翻译数据微调大型预训练模型。

translated by 谷歌翻译

情感分析是NLP中研究最广泛的应用程序之一，但大多数工作都集中在具有大量数据的语言上。我们介绍了尼日利亚的四种口语最广泛的语言（Hausa，Igbo，Nigerian-Pidgin和Yor \'ub \'a）的第一个大规模的人类通知的Twitter情感数据集，该数据集由大约30,000个注释的推文组成（以及每种语言的大约30,000个）（以及14,000尼日利亚猎人），其中包括大量的代码混合推文。我们提出了文本收集，过滤，处理和标记方法，使我们能够为这些低资源语言创建数据集。我们评估了数据集上的预训练模型和转移策略。我们发现特定于语言的模型和语言适应性芬通常表现最好。我们将数据集，训练的模型，情感词典和代码释放到激励措施中，以代表性不足的语言进行情感分析。

translated by 谷歌翻译